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Andrey Krokhotin, 1 5 [^] Martin Lundgren, 1 '^ and Antti J. Niemi 1, 2 |] 

Department of Physics and Astronomy, Uppsala University, P.O. Box 803, S-75108, Uppsala, Sweden 
2 Laboratoire de Mathematiques et Physique Theorique CNRS UMR 6083, 
Federation Denis Poisson, Universite de Tours, Pare de Grandmont, F37200, Tours, France 

The enterobacteria lambda phage is a paradigm temperate bacteriophage. Its lysogenic and lytic 
life cycles echo competition between the DNA binding A-repressor (CI) and CRO proteins. Here we 
scrutinize the structure, stability and folding pathways of the A-repressor protein, that controls the 
transition from the lysogenic to the lytic state. We first investigate the super-secondary helix-loop- 
helix composition of its backbone. We use a discrete Frenet framing to resolve the backbone spectrum 
in terms of bond and torsion angles. Instead of four, there appears to be seven individual loops. 
We model the putative loops using an explicit soliton Ansatz. It is based on the standard soliton 
profile of the continuum nonlinear Schrodinger equation. The accuracy of the Ansatz far exceeds 
the B-factor fluctuation distance accuracy of the experimentally determined protein configuration. 
We then investigate the folding pathways and dynamics of the A-repressor protein. We introduce a 
coarse-grained energy function to model the backbone in terms of the C a atoms and the side-chains 
in terms of the relative orientation of the C/3 atoms. We describe the folding dynamics in terms 
of relaxation dynamics, and find that the folded configuration can be reached from a very generic 
initial configuration. We conclude that folding is dominated by the temporal ordering of soliton 
formation. In particular, the third soliton should appear before the first and second. Otherwise, 
the DNA binding turn does not acquire its correct structure. We confirm the stability of the folded 
configuration by repeated heating and cooling simulations. 



PACS numbers: 05.45.Yv 87.15.Cc 36.20.Ey 

I: INTRODUCTION 

The transition between the lysogenic and the lytic 
state in bacteriophage A infected E. coli cell is the 
paradigm genetic switch mechanism. It is described in 
numerous molecular biology textbooks and review arti- 
cles pQ -[7]. The interplay between the lysogeny main- 
taining A-repressor (CI) protein and the CRO regulator 
protein that controls the transition to the lytic state is 
a simple model for more complex regulatory networks, 
including those that can lead to cancer in humans. 

In the present article we describe the physical prop- 
erties of the A-repressor protein, that controls the 
lysogenic- to-ly tic transition. We investigate in detail the 
stability of its native conformation, the dynamics of the 
folding process, and the landscape of folding pathways. 
We find that the folded configuration displays a struc- 
ture which is unique among all known protein structures. 
We also conclude that the folding pathways are entirely 
dominated by the loop regions. In particular, the tempo- 
ral ordering of loop formation appears to be the decisive 
factor for the protein's ability to reach its native fold. If 
solitons form in a wrong order the protein may misfold. 

Full crystallographic information of the experimental 
A-repressor structure that we use in our investigation is 
available in Protein Data Bank (PDB) [8 under the code 
1LMB. This structure is a homo-dimer with 92 residues 
in each of the two monomers. It maintains the lysogenic 
state by binding to DNA with a helix-turn-helix motif 
that is located between the residue sites 33-51. Through- 
out this article we shall use the PDB indexing of the 



residues. 

For the statistical analyses that are presented here, we 
utilize a subset of PDB data that consists of the canonical 
set of structures with better than 2.0 A resolution. We 
have compared the results with the subset that contains 
only those proteins with better than 2.0 A resolution 
and with less than 30% sequence similarity. Our conclu- 
sions are independent of the data set, and for illustrative 
purposes we here use the canonical 2.0 A set. 

This article is composed as follows: We first explain 
how to describe the geometry of a generic folded protein 
in terms of its backbone central C a carbons. We propose 
that a coarse-grained energy function, that models the 
backbone geometry, can be constructed with only the 
C a coordinates as dynamical variables. We argue that 
a variant of the discrete non-linear Schrodinger (DNLS) 
equation is a suitable Master Equation to describe folded 
proteins, in terms of its dark soliton solution. We then 
proceed to utilize this general framework to study the 
structure of the A-repressor protein. We show that the 
A-repressor backbone is composed from seven individual 
soliton solutions of the DNLS equation, within the ac- 
curacy of crystallographic structure measurements. In 
the same vein we propose, that protein folding can be 
described in terms of a coarse grained model, based on 
relaxation dynamics. We utilize this to investigate the 
folding dynamics of the A-repressor. We conclude that 
the temporal ordering of soliton formation is important 
for reaching the correct native state. A wrong ordering 
in soliton formation can be a cause for misfolding. We 
observe that the second soliton has a peculiar structure 
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that sets it apart from any other known structure in all 
proteins. 



II: METHODS: 
A: Backbone geometry 

Our analysis of the A-repressor protein will be based 
on an effective, coarse grained energy function approach 
that has been recently developed in [9] -[33]. In this ap- 
proach the protein geometry is described in terms of the 
backbone C a atoms. The ensuing bond and torsion an- 
gles assume the role of the dynamical variables. These 
angles are constructed as follows: Let be the coordi- 
nate sites of the C a carbons, where the index i = 1, N 
runs over all amino acids. For a given protein these co- 
ordinates can be read from the PDB. For each site z, we 
introduce the unit tangent vector 



the unit binormal vector 



i+l 



|ti-i — t{ 
and the unit normal vector 

n ? ; = b ? ; X t 7 ; 



(1) 



(2) 



The orthogonal triplet (n^,b^,t^) determines a discrete 
version of the Frenet frame at each position of the 
backbone. 

The backbone bond angles are 

Ki = = arccos (t i+ i ■ ti) (4) 

and the backbone torsion angles are 

Ti = Ti+\ y i — sign{bi_i x -t^}- arccos (b i+ i • b^) (5) 

Conversely, if these angles are all known, we can use 
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to iteratively construct the frame at position i + i from 
the frame at position i. Once we have all the frames, we 
obtain the entire backbone from 



rfc 



fc-i 



i+l 



(7) 



Without any loss of generality we may set i*o = 0, and 
choose to to point into the direction of the positive z-axis. 



We note that the relation ^ does not involve the vec- 
tors and b^. Consequently we may rotate the (n^b^) 
frame vectors, without affecting the backbone, by select- 
ing an arbitrary linear combination of these two vectors 
independently at each site i. For this we introduce a lo- 
cal SO(2) transformation that rotates the (n^b^) by an 
angle so that the t{ remain intact, 



cosAj sinAj 0\ /n> 
— sin Aj cos Aj b 
1/ It. 



(8) 



where the SO (3) generators are (T l )jk = e^, 

[T\Ti] = e ijk T k 
We combine n and b into the complex vector 

n + ih 

and rewrite ([8| as 

+ ibi -+ e iAl (m + ibi) = ej + ie* (9) 

The frame rotation ([8| corresponds to the following 
transformation in the bond and torsion angles, 



m T 2 -> e AiT3 ( Ki T 2 )e- AiT3 



(10) 



(11) 



(3) Since the transformation (10), (11) leaves ti intact it has 



no effect on the backbone. 

A priori, the fundamental range of the bond angle Ki 
is Ki G [0,7r]. For the torsion angle the range is Ti £ 
[— 7r,7r). Consequently we may identify (^,7$) with the 
canonical latitude and longitude angles of a two-sphere 
S 2 . However, in the sequel we find it useful to extend the 
range of Ki into [— 7r,7r] mod(27r), but with no change in 
the range of T{. We compensate for this two- fold covering 
of S 2 by introducing the following discrete Z2 symmetry 



K k -+ 
n -+ 



for all k > i 



(12) 



We note that this is a special case of (10), (11), with 



A k =7T 

A fe = 



for k > i + 1 
for k < i + 1 



The regular protein secondary structures correspond to 
definite values of (^,r^). For example standard a-helix 
is 



a — helix : 
and standard /3-strand is 

P — strand : 



r « 1 



k « 1 



(13) 



(14) 
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Similarly we can describe all the other regular secondary 
structures such as 3/10 helices, left-handed helices etc. 
with definite constant values of Ki and ti . Loops are con- 
figurations that interpolate between these regular struc- 
tures, so that along a loop the values of (^,r^) are vari- 
able. 

Finally, we compute the average value of the bond 
length in ^ using PDB. The result shown in Figure 1 is 
constructed using the canonical set of protein structures 
which are measured with better than 2.0 A resolution. 
The average value is concentrated around 



= 3.8 A 



(15) 



In our theoretical analysis we use the fixed bond length 



value (15). We also impose the forbidden volume (steric) 
constraint 



|ri - r k \ > 3.8 A for \i - k\ > 2 



(16) 



between the backbone C a atoms. This condition is well 
respected by folded protein structures in PDB. 



B: Side-chain geometry 

Following [13] we characterize the side-chain directions 
in terms of directional vectors that point from the C a 
towards the corresponding Cp carbon. At each C a 



u ? = 



' sin K)i • cos ipi 1 
sin Qi ■ sin (fi 



(17) 



cos Oj 



The latitude angle 0i counts deviation from the direc- 
tion of the corresponding Frenet frame tangent vector t$. 
When Qi = the ti and are parallel. Note that the 
angle 0i remains invariant under the rotation (ISj) . We 



FIG. 2: (Color online) In the Frenet frames the fluctuations 
in the direction of the vectors around their average value, 
given by ( 18 ), ( 20 ) and denoted by the (red) lateral vector, are 
relatively small. The left-hand side of the horse-shoe corre- 
sponds to /3-strands, the right-hand side to a-helices and the 
connecting region at the bottom corresponds to loops. The 
minuscule isolated island corresponds to the "left-handed a" 
region. 



can compute the values of 0i from the PDB. As shown in 
Figure 1 the range of variations in 0^ are quite small, it 
fluctuates around 



< > 



1.98 (rad) 



(18) 



The longitude ipi in (17) measures distance from the 
direction of the Frenet frame normal vector n^. It is the 
azimuthal angle between and the projection of on 
the normal plane spanned by (n^b^). Under the frame 
rotation (fsl) we have 



Pi (fi 



(19) 



and consequently the values of (fi depend on the framing. 
From PDB we find that in the Frenet frames the values 
of (fi are subject to relatively small fluctuations around 
the average value 



< (f > 



-2.43 (rad) 



(20) 



As shown in Figure 2, it is remarkable that the direc- 
tion of nutates in a very regular horse-shoe shaped 
manner, that reflects the underlying secondary structure 
environment. This proposes that there is a strong local 
coupling between the two angular variables 0{ and 
that depends on the secondary structure. 

The notable expection from (18), (20) is the left- 



handed loop region. It is visible in Figure 2 as a mi- 
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nuscule isolated region, with 



< > 

< If > r r 



2.25 (rad) 
-1.90 (rad) 



(21) 



But this is also quite close to the values in (18), (20). 



C: Backbone energy and solitons 

Proteins display a hierarchy which is determined by 
the spatial length scale. As the length scale increases, 
shorter distance dynamical variables become gradually 
disengaged. Following the general concept of universal- 
ity [E]-[T7], we utilize this hierarchy of scales to system- 
atically coarse grain the energy function. At each level 
of hierarchy we retain explicitly only those variables that 
remain relevant. The variables that are less and less so as 
the distance scale increases are accounted for effectively, 
through the functional form and the values of parame- 
ters in the various individual energy contributions that 
involve the remaining relevant variables only. 

In [9] -[13], see also [18] . it has been argued that a Lan- 
dau free energy function that aims to compute the over- 
all fold geometry can be based solely on those variables 
that determine the positions of the central C a atoms. 
Since the fluctuations in the bond lengths are minimal, 
see Figure 1, the leading order contribution to the en- 
ergy then involves only the bond and torsion angles for 
the C a backbone. The functional form of the energy 
can be uniquely determined by symmetry considerations. 
For this we note that any backbone energy function that 
involves only the bond and torsion angles must remain 
invariant under the SO (2) transformation ( [To|, ( [TT] ). In- 
deed, previously it has already been shown [9 -[13 how 
this SO (2) gauge invariance allows us to uniquely deduct 
the functional form of the energy function, in the long 
distance limit where any higher order non-local contri- 
bution can be ignored. In this limit the energy function 
can only involve the following terms [9]-[TT], 



E ■ 
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b r KiTi 



a T Ti 



(22) 



The detailed derivation of (22) is presented in [9 . It in- 



volves the introduction of frame (gauge) invariant com- 
binations of the Frenet frame bond and torsion angles 
[18], [19 . For the present purposes it suffices to observe, 
that (22) coincides with the one dimensional discretized 



Abelian Higgs Model Hamiltonian in the unitary gauge, 
in terms of the Frenet frame bond and torsion variables. 

We can also recognize ( [22] ) as a variant of the discrete 
nonlinear Schrodinger (DNLS) equation [20]: The first 



sum together with the three first terms in the second sum 
comprise exactly the energy of the standard DNLS equa- 
tion when expressed in terms of the Hasimoto variable of 
fluid mechanics [TT], [20]. The fourth (b T ) is a conserved 
quantity in the DNLS hierarchy, the "momentum", and 
the fifth term (a T ) is the conserved "helicity". The last 
(c r ) term is the (non-conserved) Proca mass term that 
we include for completeness. 



The energy function (22) does not purport to explain 



the fine details of the atomary level mechanisms that give 
rise to protein folding. Rather, it examines the proper- 
ties of a folded protein backbone in terms of universal 
physical arguments along the lines of [T3]-[T7]. Indeed, 
the functional form (22) is deeply anchored in the el- 
egant mathematical structure of integrable hierarchies 
[20] . Within this framework no term beyond those in the 
integrable hierarchy can be added without violating the 
underlying general and elegant mathematical principles. 



In this sense, (22 ) is the universal long distance limit that 



would emerge from any microscopic level Schrodinger op- 
erator, whenever we truncate the Landau free energy to 
explicitly include only the backbone bond and torsion 
angles. 

Remarkably, we have found that despite the very gen- 
eral nature of argumentation that leads us to adopt 
the energy function (22), it is fully capable of describ- 



ing folded protein backbones with sub-atomic precision 
of around 0.5 A, and even better [21]. This is due to 



the observation [TO], [TT], that (22) describes proteins in 
terms of solitons that are the paradigm structural self- 
organizers. Indeed, solitons are tremendously robust in 
their ability to preserve the form under both quantum 
mechanical and thermal fluctuations. 

We derive the relevant soliton profile as follow: We 
first introduce the r-equation of motion 



dE 







from which we solve 



a T + b T n* 



(23) 



Even though there are four parameters in (23) one of 



them, the overall scale, cancels out. In the sequel we 
shall choose a T = —1.0 so that for an a-helix (13) we 
have 



n\a\ 



b T n\ 



d T K\ 



and for a /3-strand (|14) 
1 



b T K\ 



+ d T Kf 



1 mod (2tt) 



7r mod (27r) 



When we use (23) to eliminate the torsion angles we 
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get for the bond angles the energy 



N-l 



N 



E[k] = - 2 K i+l*i + Yl 2 ^ + V ^ ( 24 ) 



where 



V[k] = 



Sqm 2 



1 



2b T 



q • k 



(25) 



The functional form of (25) is familiar from various stud- 
ies in mathematical physics: The first term relates to 
the potential energy in a Calogero-Moser system. The 
second and third terms have the conventional form of a 
symmetry breaking double-well potential. Depending on 
the parameter values we may be either in the broken sym- 
metry phase where n and r both acquire a non- vanishing 
and constant ground state value, or in the symmetric 
phase where k vanishes. 

In applications to proteins, regular structures such as 
helices (13) and strands (14) correspond to different bro- 



ken symmetry ground states of the energy. Furthermore, 



the numerical value of the first term in ( 25 ) is often small 
in comparison to the second and third, and so is b\ in 
comparison to 8qm 2 so that for an a-helix (13) 



m • 



and for a /3-strand 



7T 

2 



1.0 



(26) 



(27) 



In [TO] . [TT] it has been shown that loops, which are re- 
gions where (^,r^) are variable, can be constructed in 
terms of the dark soliton solution of the generalized dis- 
crete nonlinear Schrodinger equation that derives from 



the energy (24) 



dV[K] 
da? 



(i = l,...,N) (28) 



where we set = ^at+i = 0. This is the Master equation 
that we use to compute the shape of a folded protein C a 
backbone. 



D: Soliton Ansatz 

We do not know the explicit form of the dark soli- 
ton solution to the present discrete nonlinear Schrodinger 
equation. But a numerical approximation can be easily 
constructed using the procedure described in [TT]. Fur- 
thermore, since it turns out that in the case of proteins 



the first term in (25) is small, an excellent approximation 



[T2] is given by the naive discretization of the continuum 
dark NLSE soliton [20], 



(/xi+2ttJVi) 



po-i(i-s) 



(/i 2 + 2nN 2 ) • e 



-a 2 (i-s) 



" l e cr\{i — s) _|_ e —o- 2 (i—s) 

(29) 

Here s is a parameter that determines the backbone site 
location of the soliton center. This is the center of the 
fundamental loop that we describe by the soliton. The 
Mi, 2 £ [0,7r] are parameters. In the case of proteins the 
values of /ii 5 2 are entirely determined by the adjacent 
helices and strands. Far away from the soliton center we 
have 

/ii mod (2tt) i > s 
—ji2 mod (2ir) i < s 

For a- helices and /3-strands the /ii 2 values are deter- 



mined by (13), (14). Negative values of Kj'i are related to 



the positive values by (12). 



The Ni and N 2 constitute the integer parts of 
and for simplicity we shall take ~N\ = N 2 = N. This in- 
teger is like a covering number, it determines how many 
times Ki covers the fundamental domain [0, ir] when we 
traverse the soliton once. We introduce this integer for 



the following reason: The Ansatz (29) is monotonic but 
in general the values of ^ G [0,7r] that we obtain from 
PDB are not. Whenever we encounter a site i where k\ 
in the PDB data fails to be monotonic, we either add or 
subtract 2ir to its value, to regain a monotonic data pro- 
file. This shift does not have any effect on the backbone 
geometry. In this manner we utilize the multi-valuedness 
of Ki to convert any sequence {^} into a monotonic one, 
that we can then approximate by the Ansatz (29). 

Note that for /ii = /i 2 and g\ = o 2 we recover the hy- 
perbolic tangent. In this case the two regular secondary 
structures before and after the loop are the same. More- 
over, only the (positive) g\ and cr 2 are intrinsically loop 
specific parameters. They specify the length of the loop 
and as in the case of the /ii 2 they are combinations of 



the parameters in (22) 



Similarly, in the case of the torsion angle there is only 



one loop specific parameter in (23). The overall, common 



scale of the four parameters is again irrelevant and two of 
the remaining three parameters are fixed by the regular 
secondary structures that are adjacent to the loop. 

Long protein loops, and entire super-secondary protein 
structures such as a helix-loop-helix can be constructed 



by combining together solitons (28), (23). A typical short 
super-secondary structure that is described by a single 
soliton involves at least 15-20 different amino acids, of- 
ten even more. As a consequence our DNLS equations 
and the explicit soliton Ansatz compute the 45-60 spa- 
tial coordinates of the ensuing C a carbons in terms of of 



the five essential universal parameters in (22). This im- 



plies that the DNLS equations comprises a highly under- 
determined system of equations, and the key physical 
principles of our approach are experimentally testable. 
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In [21] it has been shown using the Ansatz (29) that 
over 92% of PDB configurations can be constructed in 
terms of 200 explicit soliton profiles. This makes a strong 
case that the solitons of the DNLS equation are the mod- 
ular building blocks of folded proteins [2T] . 



E: Side-chain energy function 

In order to account for the C/? contribution to the pro- 
tein free energy, we augment (22) by terms that involve 



the variables (0i,ipi) in (17). We shall assume that side- 



chain directions are locally slaved to the backbone. By an 
explicit analysis of PDB structures using Frenet frames 
this can be confirmed to be the case [22 j, as shown in 
Figure 2. 

The C/3 latitude angles 0^ are gauge i.e. frame invari- 
ant: The 0i are entirely determined by the tangent vec- 
tors ti, and consequently can not depend on the choice 
of framing. To the leading order we may then assume 
that each 0i interacts locally, with the corresponding Ki 
only. The leading order contribution is obtained by Tay- 
lor expanding a general interaction potential around the 
(k,i dependent) equilibrium values of the 0^, 



1=1 k 



c e 2 



... (30) 



where the additional terms are of higher order in powers 
of Kj and 0j. From this we solve for 0», 



a e + bpK% 
c e + derf 



(31) 



we specify the unitary gauge, which amounts to selecting 
the Frenet framing along the backbone. 

As visible in Figure 2, the fluctuations in tpi are about 
as small as those in 0^. Moreover, in the combinations 



(32) the torsion angles T{ are determined locally by the 
bond angles according to (23). Consequently we may 
also Taylor expand the (pi contribution to the energy, 



and following (30) we conclude that the leading order 



contribution is of the form 



N 



■n 06 09 2 2 t 2 , 2 

E<p = / Kj<Pi - K K i Pi ~ °W + 2 

i=l ^ 



}♦■■■ 

(33) 



This slaves cpi to the backbone Ki according to 

** = Hri (34) 

Again only three of the four parameters are independent, 
the overall scale drops out. 



F: Total energy 



We combine (22), (30) and (33) to arrive at the total 
energy 



E — E K + E T + Eq + Eu 



(35) 



N-l 

E 

i=i 



N 



i=l 



2 Ki +1 Ki + yl2Ki + q- (/< - m 1 ) 1 } (36) 



Again, as in the case of (23), the overall scale cancels 



which leaves us with only three independent parameters. 
As visible from Figure 2, the fluctuations in 0^ around 
the average value (18) are small. From this Figure we 
also learn [22] that these fluctuations are slaved to the 
backbone geometry, which is dictated by the ^. This 



confirms that the present approximation (30) is reason- 
able. 

Unlike 0$, the longitude angle tpi does not remain 
intact under the frame rotation ([8| but transforms ac- 
cording to ( [l9| ). Consequently it can form a SO (2) frame 
(i.e. gauge) invariant combination with the backbone 
torsion angle (J5|. Two examples of gauge invariant com- 
binations are 



and 



k=l 



(32) 



We prefer to proceed with the second one, it is local in 



ifi. (The first is a difference of the second.) As in (22) (34). 



N 

E 

i=i 



N 

E 



N 

E 



d T 

y 



2 2 



de 



}0l - be^Oi - a e 0i 



2 2 



00 a 2 
~2 di 



(37) 



(38) 



+ ... (39) 



Since the variations in (0i,ipi) are much smaller than 
those in r^, the ensuing contributions Eq and are also 



much smaller than E T . Furthermore, according to (39) 
the backbone torsion angles and the side-chain angles 
(0i,ipi) are all slaved to the backbone bond angles K{. 
As a consequence the DNLS equation (28) is the Mas- 



ter Equation that entirely determines the geometry of a 
folded protein: The C a — C^ backbone is constructed by 
first solving for ^. The remaining two angles (0^, ipi) are 
then constructed in terms of the K{ using (23), (31) and 
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G: Parameters 



The energy function (35)- (39) involves a number of pa- 
rameters. Eventually, we would like to compute their 
numerical values directly from the amino acid sequence. 
But at the moment this has not yet been achieved. 

A priori it appears that the number of parameters 



needed in (35)- (39) to describe an entire protein back- 



bone, could be quite large. However, due to the presence 
of the dark soliton the number of parameters is actu- 
ally remarkably small: For each super-secondary struc- 
ture such as a helix-loop-helix, whenever the loop can be 



described in terms of a single soliton solution to ( 28 ) , the 



potential (25) has only four independent parameter com- 



binations. In addition of q and m that characterize the 
second and third term, there are only two essential pa- 
rameters in the first term. Three of these four parameters 
can be given the following interpretations. Two of them 
determine the values of K{ in the ground states such as 



( 13 ), (14) that are located along the backbone before and 
after the soliton i.e. the type of the helix that precedes 
and follows the soliton. The third parameter determines 
the length of the loop. The fourth parameter can then be 
included as one of the three independent parameters in 



the torsion profile (23). It can be attributed to the length 



of the soliton, in terms of torsion. In addition, in the soli- 
ton profile of k\ there is the parameter that specifies the 
position of the soliton along the backbone. This is an 
additional parameter that emerges from the periodicity 
of the lattice structure (lattice translation invariance). 



Since the overall scale in (23) cancels out, the two re- 



maining parameter combinations in addition of the loop 
length, become determined by the values of in the 
ground states surrounding the soliton i.e. the type of 
the helix as in (131), (14). 



In this way we arrive at the conclusion that for the 
backbone, the only loop specific parameters are those 
that determine the lengths of the solitons. All additional 
parameters in the energy function determine the regular 
secondary structures such as (13) and (14). The pro- 



files of all loops are completely fixed by the unique dark 



solution solution to (28). 



Similarly, we conclude from (31 ), (34) that in the equa- 



tions that determine the Frenet frame orientations of the 
C/3 carbons, there is only one loop specific parameter in 
both Oi and cpi. In each equation, the overall scale fac- 
tor cancels out and the values of the additional two in- 
dependent parameters are fully specified by the regular 
secondary structures adjacent to the loop. 



Since (35)-(39) aims to predict the 45-60 space coordi- 



nate of C a , and the corresponding 30-45 directional co- 
ordinates of the C/3 in terms of 11 essential parameters, 
we have a highly underdetermined system of equations. 
This implies that the physical principles of our approach 
are experimentally testable. 



In [21]! we have found that most crystallographic pro- 
tein structures in PDB can be described in a modu- 
lar fashion and with experimental B-factor precision, by 
combining together no more than 200 explicit soliton pro- 
files. We propose that by learning how to compute the 
parameter values directly from the sequence, the geo- 
metric shape of most folded proteins can be constructed 



simply by solving the Master equation (28). 



H: Fluctuations around solitons 



As such, the equations (28), (23), (31), (34) describe 



the critical points of the energy function (35). This en- 
ergy function should be duly interpreted as describing 
the effective long distance limit of the full microscopic 
second-quantized Schrodinger operator. As such, (35) 



then relates to proteins in the sense of a semi-classical 
approach. This kind of semi-classical description is com- 
mon in quantum field theories. There, it is often boldly 
used to describe phenomena at length scales that are sev- 
eral orders of magnitude smaller than anything which 
may have any direct relevance to proteins. We now wish 
to estimate the short distance scale, at which we expect 



the semiclassical approach to proteins based on (35), to 



break down due to quantum mechanical zero-point fluc- 
tuations. 



The backbone profile (28), (23) describes the C a lat 



tice in the limit where thermal fluctuations vanish. But 
even near zero temperature the protein remains subject 
to residual zero-point fluctuations that can not always 
be ignored. It is difficult to estimate and even harder 
to accurately calculate the amplitude of these zero-point 
fluctuations. For a realistic order of magnitude estimate 
we inspect the distribution of the B-factors that char- 
acterize experimental uncertainties in PDB data. We 
use all those PDB structures where the crystallographic 
measurements have been made at temperatures less than 
50K. The result is displayed in Figure 3. It shows that for 
the C a carbons the zero point fluctuations have a lower 
bound which is in the vicinity of 0.15 A. Consequently 
we estimate that the precision of our semi-classical ap- 
proach can at best be of the same order of magnitude. We 
account for these zero point fluctuations by dressing our 
(semi) classical backbone profiles with a tubular dominion 
that has a radius of 0.15 A. In particular, this tubular do- 
minion accounts for the bond length fluctuations, around 



the average distance (15) between the neighboring C a 



carbons. As can be seen from Figure 1, the fluctuations 
in the C a — C a distances are normally within range of 
0.15 A. 
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FIG. 3: The number of entries in PDB with temperature 
below 50K vs. Debye- Waller fluctuation distance. Note the 
logarithmic scale. 



Ill: A-REPRESSOR PROTEIN AS A 
MULTISOLITON CONFIGURATION 

A: Loop spectrum of 1LMB 

We start our soliton-based investigation of the A- 
repressor protein by analyzing its (fti,r^) spectrum. This 
will identify the putative soliton content. We use the 
first chain of the homo-dimer with the PDB code 1LMB. 
The structure is conventionally interpreted as a four loop 
configuration, and the second loop is the DNA binding 
one. 

From the PDB file we read the C a coordinates. We 
compute the tangent vectors from ([!]) and the binormal 
vectors from (|2|, and the bond and torsion angles from 
Q and ([5]). We construct these angles using the standard 
convention that Ki G [0,7r]. We locate the regions where 
the torsion angles are subject to large fluctuations. In 
these regions we judiciously implement the transforma- 
tion (12). This identifies the soliton structures in the 



loops. Both the motivation and the technical details of 
the soliton identification procedure are described in [TO] 
and [13] . 

In the left hand side of Figure 4 we show the (fti,T^) 
spectra that we obtain from the PDB data using the stan- 
dard differential geometric convention that curvature is 
positive Ki > 0. Each of the four figures describes the 
spectra over one of the four loops of 1LMB, as they are 
identified in PDB. We observe that in each Figure 4 on 
the left hand side, there is a region where the torsion 
angle T{ fluctuates rapidly between positive values and 
negative values. On general grounds [10], [13] a region 
where the torsion angle is solely positive and only subject 
to small variations is a putative regular secondary struc- 
ture. On the other hand, a region where the torsion angle 
fluctuates between positive and negative values is an in- 
dicative of an inflection point [13] , this kind of fluctuation 
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FIG. 4: (Color online) On the left column, we show the (ft;, n) 
spectrum for the four PDB loop structures of 1LMB, that we 
compute with the prevailing differential geometric convention 
that curvature is a nonnegative quantity. Black is the bond 
angle ft;, grey (red) is the torsion angle t;. On the right we 
display the corresponding spectra after we have identified the 
soliton profiles in ft;, using (12). All PDB loop structures 
except for the second, display the profile of a soliton pair. 
Only the second PDB loop can be identified as a single soliton 
state. 



suggests the presence of solitons. By judiciously apply- 



ing the gauge transformation ( 12 ) in the regions where T{ 
fluctuates between positive and negative values, we find 
that the Ki profiles in each of the left hand side Figure 4 
indeed describe solitons: Comparing with (13), (14) we 



conclude that the first loop structure in the PDB data is a 
pair of solitons separated by a short a-helix. The second 
PDB loop is a single soliton that interpolates between 
two a helices. The third loop structure in the PDB data 
is a pair of solitons separated by a short /3-strand. Fi- 
nally, the fourth PDB loop is a pair of solitons, separated 
by a short a-helix. 

Notice that our spectral analysis based structure iden- 
tification is a refinement of the conventional one, which 
utilizes techniques such as the presence or absence of hy- 
drogen bonds to conclude whether a site is part of a reg- 
ular secondary structure or part of a loop. Consequently 
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very short helical structures that become clearly visible 
in our (^,r^) spectral analysis, can be interpreted differ- 
ently in more conventional approaches. 



B: Soliton Ansatz 



(23) 



We proceed to evaluate the parameters in (29) 
that give our best fit to those seven soliton profiles, iden- 
tified by analyzing the (^,r^) spectrum. In Table 1 we 



TABLE I: Our best parameter values for each of the seven 
solitons in Figure 2. Note that mi, m2 are determined 
mod (2tt). 



Soliton 


Cl 


c 2 


mi 


7712 


(8,30) 


2.00441 


1.99595 


26.65124 


26.68412 


(26,39) 


2.94889 


2.95201 


70.67882 


70.60369 


(36,47) 


2.89729 


2.90755 


39.27387 


39.22546 


(46,59) 


2.97927 


3.00015 


1.07948 


1.52942 


(56,65) 


2.96486 


2.97087 


26.69087 


26.25280 


(62,74) 


2.94948 


2.94547 


20.43071 


20.38220 


(74,87) 


2.89725 


2.89945 


89.50870 


89.55252 



Soliton 


s a/b d/b 


(8,30) 


24.50259 -9.9921 • 10" 2 4.2191 • 10" 5 


(26,39) 


30.49642 -1.5114 • 10" 7 1.0662 • 10" 11 


(36,47) 


41.39325 -5.3794 • 10~ 7 7.4566 • 10 -11 


(46,59) 


53.67225 5.1477 • 10~ 7 -5.1529 • 10" 7 


(56,65) 


57.85123 -9.62942 • 10" 8 1.45097 • 10~ 12 


(62,74) 


70.22069 -9.27151 • 10" 7 3.05202 • 10" 10 


(74,87) 


75.56315 -7.13705 • 10~ 7 1.8457 • 10" 11 



present our best parameter values. The parameters can 
be computed by various different techniques. Here we 
have chosen a Monte Carlo search that is fast and gives 
very good accuracy. In Table 1 we also identify the corre- 
sponding super-secondary structures by their PDB back- 
bone indices. Note that since our approach is based on 
the spectral analysis of bond and torsion angles and since 
the definition of a bond angle involves three sites while 
that of the torsion angle involves four, three residue in- 
dices at both ends of the 1LMB chain are absent in the 
(K>i,Ti) list. Notice also that we have introduced some 
overlap between the different super-secondary structures. 
This ensures that we can eventually combine them to- 
gether into the full 1LMB backbone. Moreover, in the 
case of all except the fourth soliton, we have utilized the 
multi-valuedness in the definition of the bond angle to 
extend the range of its values outside of the fundamen- 
tal domain Ki G [0,7r]. This corresponds to selecting 
non- vanishing integers iVi, N 2 in the Ansatz (29). For 
simplicity we have limited our Monte Carlo search to the 
symmetric case N\ = A^, but for better accuracy the 



soliton profiles could also be given asymmetric integer 
parts. The numerical values of the bond angles ^ that 
we compute from ( 29 ) using the parameter values in Ta- 
ble I, are to be reduced onto the fundamental domain 
[0, 7r] using the mod(27r) multivaluedness. 

We recall from Section II D, that the introduction of 
non- vanishing values of Ni and N 2 enables us to account 
for the non-monotonic profile that the bond angles of 
the PDB configuration display when restricted into the 
fundamental domain: At each point where the profile of 
the PDB data becomes non-monotonic, we simply add 
(or subtract) 2ir until we obtain a monotonic profile. In 
this manner, by allowing the bond angle to take values 
over a larger domain, the k\ profile of the PDB data over 
each soliton can be made monotonic which improves the 



accuracy of the fitted soliton profile (29). See Figure 5 



where we display the second PDB loop together with its 



approximation by the Ansatz (29) in the extended range, 
as an example. 
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FIG. 5: The data points correspond to the second loop of 
1LMB, after we have properly extended the range of the bond 
angles, by using the inherent multivaluedness of these angles. 
The interpolating function describes the Ansatz ( 29 ) with the 
parameter values given in Table I. 



We compute the torsion angles from (23), before 
implementing the 2tt reduction of the bond angles. We 
then reduce the ensuing values of the Ti into the fun- 
damental domain T{ G (— 7r,7r] using the 2tt periodicity. 
The underlying multivaluedness entirely accounts for the 
fluctuations in the profile. 

In Figure 6 we compare the explicit backbone profiles 
that we have computed from (29), ([23]), ([6|, ^ using 
the parameter values in Table I, with the 1LMB data in 
PDB. We find that for each soliton, our backbone pro- 
files describe the structural motifs of 1LMB with a pre- 
cision that is substantially better than the experimental 
precision which is determined by B-factors. This persist 
even when we account for the 0.15 A estimate of the 
zero point fluctuations around our solitons. With these 
highly precise soliton profiles we can unambiguously con- 
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FIG. 6: (Color online) The distance between the PDB back- 
bone of the first 1LMB chain and its approximation by the 
seven solitons |29| , ( [23] ) as a function of the residue number. 
The black line denotes the distance between the soliton and 
the corresponding PDB configuration, the grey area around 
the black line describes the estimated 0.15 A zero point fluc- 
tuation distance around solitons, obtained from Figure 3. The 
grey (red) line denotes the Debye- Waller distance that is com- 
puted form the B-factors in PDB. 



elude that 1LMB has a total of seven solitons. Two of 
the a-helices and the sole /3-strand are so short that the 
ensuing soliton pairs are interpreted as single loops in the 
conventional, hydrogen bond based analysis of the PDB 
data. The only motif where our construction identifies a 
single PDB loop as a single isolated soliton, is the DNA 
binding one. All of the remaining three loops consist of a 
pair of solitons with profile ( [29] ) ? ( [23] ) that are separated 
from each other either by a very short a-helix in case 
of the residues (23,33) and (69,90), or by a very short 
/3-strand in case of residues (51,61). 

We have performed a statistical analysis on the oc- 
currence of our seven solitons in all PDB proteins. In 
Table 2 we list the number of matches that each of these 
solitons has when we search PDB for configurations that 
deviate from the given soliton by an overall root mean 
square distance (RMSD) which is less than 0.5 A. We 
have chosen this cut-off value since it is representative 
of the Debye- Waller fluctuation distance in the experi- 
mental 1LMB data; see Figure 6. The interesting obser- 
vation is that the second soliton of 1LMB that precedes 



TABLE II: The (minimal) soliton sites that we have used 
in searching for matching structures in PDB, together with 
the number of matches. The search is limited to those x- 
ray structures that have a resolution better than 2.0 A and a 
match is a configuration that deviates less than 0.5 A in total 
root mean square distance (RMSD) from the soliton. 



Soliton 


1 


2 


3 


4 


Sites 


(20,28) 


(27,36) 


(36,46) 


(50,58) 


Matches 


9601 


4 


810 


159 



Soliton 


5 6 7 


Sites 


(55,63) (66,75) (74,82) 


Matches 


1552 1342 406 



the DNA recognition helix, is unique to the A-repressor 
protein. The only matching structures are located in the 
different PDB entries of the A-repressor protein itself. We 
also note that for this soliton the B-factors are slightly 
higher than for any of the other six solitons along the 
1LMB backbone. 

Finally, by following [12] we have combined the seven 
solitons together to describe the sites 8-90 (PDB index- 
ing) of the entire 1LMB backbone. The result is shown 
in Figure 7. The overall RMSD accuracy that we get in 




FIG. 7: (Color online) The PDB backbone configuration of 
1LMB (dark red) and its multi-soliton (light green) approxi- 
mation, for sites 8-90. The total RMSD distance is 0.45 A. 



this manner is 0.45 A. This could be further improved by 
adjusting the parameters while combining the solitons. 
But as such, the accuracy displayed by the configuration 
in Figure 7 is below the experimental B-factors. Con- 
sequently any improvement is senseless, in light of the 
quality of the experimental data. 



11 



C: Soliton solution and 1LMB 

We proceed to construct the parameters corresponding 
to the 1LMB backbone, for both the C a -atoms and the 



side-chain C/3-atoms, using the energy function (35). We 
first solve (28) to obtain the bond angle i.e. ^-profile. 
We then construct the backbone torsion angles and the 
side-chain (0i,ipi) angles from (23), (31), (34). We con- 



struct a single solution of (28), that describes the entire 
chain. 

For simplicity of construction, we restrict the values of 
K{ into the fundamental domain K{ G [0, 7r]. As in the case 
of the Ansatz (29), a higher precision could be obtained 
by allowing to take values beyond the fundamental 



domain. But with (35 ), it turns out that we can reach the 



B-factor accuracy by constructing the solution of (28), 



using the fundamental domain only. This is because the 



solution of ( 28 ) is even better suited for modeling proteins 



than the Ansatz (29). 



To construct the parameters that describe the back- 
bone K{ profile of 1LMB, we use the iterative learning 
algorithm of [TT] . It determines the parameters by locat- 
ing the fixed point of 



2k, 



(n) 



» 



'.)} 

(40) 

that minimize the RMSD distance to 1LMB. Here 
{ft- n ^}i G 7V denotes the n th iteration of an initial configu- 
ration {ft-^j^Tv and e is some sufficiently small but oth- 
erwise arbitrary numerical constant. We select e = 0.01. 
A fixed point of (Eol) clearly satisfies the DNLS equation 



(28). Following [TT] we utilize step- functions to construct 
an initial configuration for K{. The ensuing initial pro- 
file of is chosen to have the same overall profile as 
the properly gauge transformed 1LMB that we display in 
Figure 4 right hand side column. A Monte Carlo routine 
is set up to determine the parameters. For this we have 
developed a package that we call Propro [24]. It imple- 
ments our parameter learning algorithm for a given pro- 
tein structure, largely automatizing the entire process. 

In Figure 8 we show a Propro screen capture of the Ki 
profile that describes the final multi-soliton solution that 
yields the shortest overall RMSD distance between the 



solution to (40) and the 1LMB structure, for the back- 
bone C a carbons. In Figure 9 we show the corresponding 



Ti profile, computed from (23). The backbone C a RMSD 



distance between out multi-soliton solution and 1LMB is 
0.52 A. This is slightly larger that what we obtained with 



the Ansatz ( 29 ) . But this time we have not extended the 
values of ^ outside of the range ^ G [— 7r,7r]. In Figure 
10 we display the distance between the 1LMB and the 



soliton solution to (28), (23) for the individual C a atoms. 



For the most part the distance between our multi-soliton 
solution and 1LMB is below the Debye- Waller fluctua- 
tion distance. The only real exception is located at the 
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FIG. 8: (Color online) The multi-soliton solution to (40 ) (line) 
and the PDB values of 1LMB (dots) along the C a backbone, 
for the backbone bond angles 




FIG. 9: (Color online) The profile of torsion angle computed 
from (23) (line) and the corresponding PDB values of 1LMB 
(red dots) along the C Q backbone. 



site 31, where the distance between 1LMB C a carbon 
and the soliton solution is close to 1.6 A. This site is lo- 
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FIG. 10: (Color online) The distance between the 1LMB back- 
bone and our multi-soliton configuration, constructed by solv- 
ing ( 28 ) (black line) . The grey shaded area around the black 
line describes the estimated 0.15 A zero point fluctuation dis- 
tance around the multi-soliton solution. The grey (red) line 
describes the experimentally measured Debye- Waller fluctua- 
tion distance 
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cated at the second soliton. We recall that this soliton 
is unique for 1LMB (see Table II) and that it was also 
singled out by the Ansatz (29). Our analysis indicates 
that something takes place at this soliton that warrants 
a more careful experimental analysis. The properties of 
this soliton might have a role in the transition from lyso- 
genic to lytic state. 

We proceed to extend our multi-soliton to describe 
both the positions of the C a carbons, and the directions 



of the Cp carbons. For this we use (31) and (34). The fi- 
nal configuration has a combined C a - Cp distance of 0.6 
A to 1LMB. In Fi gure 11 we compare the corresponding 
structures, the 1LMB backbone is translucent. 




FIG. 11: (Color online) Comparison between the C a - Cp 
multi-soliton solution (opaque, with colors online) and the 
1LMB structure (translucent, grey online). The RMSD dis- 
tance is 0.6 A. 




FIG. 12: (Color online) Close-up of Figure 11 around site 31, 
the location of the second soliton. Multi-soliton is opaque, 
the PDB structure of 1LMB is translucent. 



nation with the requirement that the energy must be in- 
dependent of the coordinate frame where it is computed. 
Consequently, by construction, our energy function cor- 
rectly describes the leading order long distance contri- 
bution to any energy function that is grounded on more 
detailed atomic level considerations. All the variables 
and interactions that are less relevant for the description 
of the C a geometry, are accounted for effectively through 
the functional form and the parameter values of the in- 



dividual contributions to (36), (37). 



We shall try and approach protein dynamics in the 
same universal manner. We average over all very short 
time scale oscillations, vibrations and other tiny fluctu- 
ations in the positions individual atoms that are basi- 
cally irrelevant to the way how the folding progresses 
over those time scales that are biologically relevant. The 
general concept of universality [H]-[T7] proposes us to 
adopt a Markovian Monte Carlo time evolution with the 
following universal, coarse grained heat bath probability 
distribution 1251. l26l. [271 



x AE 
— with x = ex P{-"^} ( 41 ) 



In Figure 12 we have a close-up of the region around 
PDB site 31, where the difference between the multi- 
soliton solution and the 1LMB configuration is largest. 



Finally, in Table III we list the parameters in (35)-(39), 
for all the seven solitons. 



IV: COLLAPSE STUDIES OF 1LMB 



In the backbone energy (36), (37) we have retained 



only those variables that are relevant to our description 



of the C a geometry. The derivation of (K36j), (37) is based 



on the general concept of universality [14] -[17], in combi- 



Here AE is the energy difference between consecutive 
MC time steps, that we compute from (36), (37). We 



select the numerical value of the temperature factor kT 
so that the model describes the appropriate phase. In [28] 



it has been shown that (36), (37) is capable of describing 
the three phases of polymers. At low values of kT we 
are in the phase of collapsed proteins. As the value of 
kT increases and reaches the O-point value, there is a 
transition to random coil phase. When the temperature 
reaches even higher values, there is a cross-over to self- 
avoiding random walk. 

It turns out that in the collapsed phase, below the 6- 
point temperature, the universal aspects of folding dy- 
namics are quite independent of the numerical value of 
kT. For concreteness, we perform our simulations in the 
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collapsed using the value 

kT = 1(T 15 

Note that the overall normalization of kT can always be 
changed by an overall normalization of the parameters in 
Table III. 

We shall assume that during the folding process there 
are no re-arrangements in the backbone covalent bond 
structure, such as chain crossings. For this we introduce a 
self-avoidance condition that keeps the distance between 
any two backbone C a atoms at least as large as the length 
of a typical van der Waals radius which is around ~1.3 
Angstrom. 



Note that we do not propose that (41) is capable of 



describing the atomic level dynamics of the folding pro- 
cess. Such minuscule details are highly sensitive to the 
initial atomic configuration. A detailed knowledge of the 
time evolution of a particular atom during the collapse 
can hardly have any practical relevance for the under- 
lying physical principles and phenomena. Thus, for the 
purpose of conceptually understanding the temporal evo- 
lution of a protein towards its native conformation, the 



dynamics described by (41) is sufficient. We argue that 



the combination of (35), (41) correctly captures the uni- 



versal statistical aspects of protein collapse over the bio- 
logically relevant temporal and spatial scales. 



A: Antiferromagnetism and folding nuclei 

In our approach, proteins have a modular structure. 
A folded protein is built by combining together solitons 



of the discrete non-linear Schrodinger equation (28), one 



after another. From the point of view of the energy func- 
tion (36), (37), a uniform helical configuration is one with 



and with the value of ti computed from (23) this is a 



ground state of the energy. In particular, a straight linear 
rod is a special case, it is a ground state when m = 0. 

As usual, the physical principles that give rise to pro- 
tein folding are best analyzed in the absence of other 
processes and interfering agents. For this reason, in the 
present sub-section, we study the soliton formation along 
a helical backbone, by inspecting how the folded config- 
uration builds from the ground state of the energy. Con- 
sequently we use a straight helical structure as our initial 
configuration. In the case of the A-repressor protein we 
are particularly interested in the formation of the third, 
DNA binding soliton. 

The parameter values in Table III are uniform over 
each putative super- secondary structure. In particular, 
there is no information on the loop locations. In a pro- 
tein, the placement of a loop often correlates with the po- 
sition of certain amino acids, such as proline and glycine, 



that act as folding nuclei. In order to model a folding 
nucleus we introduce a transient parameter a that sends 



the first term in (36) into 



N-1 



^2 2 ^ 



+ 1^2 



N-1 

£ 

i=l 



(2(7 - 1) • 2« i+ i«i 



(42) 



Initially, we set a = for all i except at the links (z, i + 
1) along which the putative soliton centers are located. 
At these links we start with a = +1. This corresponds 
to an anti- ferromagnetic i.e. repulsive nearest neighbor 
interaction between the ensuing two sites. During the 
early stages of the simulation, we decrease the values of 
a so that after some number of steps we reach the uniform 
final value a = for all links along the entire chain. 

For each super-secondary structure the value of m 
in Table III determines the regular secondary structure 
which is located either before or after the corresponding 



loop, and the value of the parameter q in (36) determines 



the propensity of this structure to form. The stability of 
a helices and f3 strands is due to hydrogen bonds that 
form during the collapse, and consequently the value of 
q can be interpreted as a measure of the strength of hy- 
drogen bond interactions. In our simulations we wish 
to start from an initial configuration with no initial hy- 
drogen bonds. We conform to this by setting all q = 
initially. We then switch on the hydrogen bond inter- 
actions by increasing the values of the parameters q to 
those given in Table III. We find that this stabilizes the 
regular secondary structures. 

We wish to investigate the effects that the temporal 
ordering of loop formation has on folding and on misfold- 
ing. For this we compare different orderings in removing 
g and in switching on the values of the q. 

In the case of the lysogeny maintaining A-repressor 
protein we have considered various scenarios to conclude 
that there is the following general pattern: The seven 
solitons that we display in Figure 4 (right column) tend 
to form as pairs, with (2,3), (4,5) and (6,7) each a soliton- 
soliton pair. The first soliton is also made with a pair. 
But after the formation, the pair of this soliton moves 
away and disappears through the A/"-terminal of the back- 
bone. The alternative, where the first and second soliton 
form as a pair seems to give rise to a misfolded state that 
furthermore appears to be unstructured. We now de- 
scribe in detail two generic examples that illustrate this 
general pattern: 

First example: In our first generic example we start 
from a uniform, straight helical structure. In the Fig- 
ure 13a we show the initial K{ profile. We note that 
there is substantial latitude in choosing the initial val- 
ues of Ki and Ti. To begin with, we also set all q = 
so that there are no hydrogen bonds to stabilize the he- 
lical structure. All the remaining parameters have the 
values that are listed in Table III. We introduce an an- 
tiferromagnetic coupling a = +1 at links + 1) with 
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i = 16, 23, 33, 46, 49, 63, 67. This models the folding 
nuclei at the putative positions of the centers of the soli- 
tons. During the first part of the simulation, say during 
the first 2,000,000 Monte Carlo steps, we adiabatically 
remove the folding nuclei by linearly decreasing the val- 
ues of (7 until we reach the final ferromagnetic values 
a = at all sites. At the same time we turn on the hy- 
drogen bonds, by increasing the values of the couplings q 
from zero to the values given in Table III. After the first 
2,000,000 steps all parameters along the backbone then 
have the values shown in Table III. 

We remind that according to our observations there is 
a lot of latitude in details of the procedure. 

We find that the last four solitons form as the pairs 
(4,5) and (6,7). At the putative location of the third 
soliton, we also observe the formation of a soliton-soliton 
pair. But the soliton which is located closer to the TV- 
terminal moves towards left as shown in Figure 13, and 
becomes anchored at the site 23 where it forms the sec- 
ond soliton of 1LMB. A new soliton pair appears at the 
putative location of the first soliton, one of these solitons 
disappears through the TV-terminal and the entire back- 
bone stabilizes rapidly into the correct native state. See 
Figure 13. 

Second example: In this example we simulate a sce- 
nario where the first pair is formed before the third soli- 
ton. The initial configuration is a folded structure, with 
the fully formed soliton pairs (1,2), (4,5) and (6,7) i.e. 
these solitons have the same (^,7$) values as the 1LMB. 
But the DNA binding third soliton is absent and instead 
there is a helix extending from the second to the fourth 
soliton, see Figure 14 and 15. The simulation starts 
with an antiferromagnetic a = 1 at link i = 34, and with 
no hydrogen bond interactions i.e. q = between second 
and fourth solitons. There is then an initial production 
of a soliton-soliton pair as seen in Figure 15b and we find 
two possibilities: 

If the hydrogen bonds form slowly i.e. q grows to its 
final value slowly in comparison to the removal of a, we 
arrive at a final fold where there is an additional soli- 
ton (loop) around site 39. We display the bond angle 
profile in Figure 14c. The protein is now misfolded. On 
the other hand, if the hydrogen bonds are formed more 
rapidly, we arrive at a configuration where the third and 
fourth soliton annihilate each other. No soliton is formed, 
but due to steric restraints there is a slightly irregular 
helical region between sites 31 and 45. The final config- 
uration is shown in Figure 15. 

We conclude that if the (1,2) pair forms before the 
third soliton, the protein becomes misfolded into a state 
with more than one conformational substrate. We pro- 
pose that this could be confirmed experimentally. It 
might relate to the transition from the lysogenic to the 
lytic state in A-phage. 
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FIG. 13: (Color online) Time series of soliton (loop) forma- 
tion along the 1LMB backbone in our simulation. The black 
line shows the time evolution of the solitons, the grey (red) 
line is the PDB profile. Time increases from top down. In the 
first Figure from the top we have the initial configuration, a 
straight helical structure. The solitons 4-7 form as the two 
soliton pairs (4,5) and (6,7). At the position of soliton 3, there 
is also a pair formation. But the pair of soliton 3 propagates 
along the chain towards the TV-terminal, until it becomes an- 
chored at site i = 23 where it forms the second soliton. A 
new soliton pair forms at site 16 and one of these two soli- 
tons moves and disappears through the TV-terminal, leaving 
us with the first soliton and the correctly folded backbone 
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FIG. 14: (Color online) If hydrogen bonds form slowly, there 
is an extraneous soliton that disturbs binding between 1LMB 
and DNA, and probably no binding is possible. Grey line is 
the 1LMB profile, and black with red dots is the simulated 
profile. 
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FIG. 15: (Color online) If the hydrogen bonds are formed 
early, the backbone enters in an unstructured state where the 
3 soliton become mis-formed. Consequently there is no soliton 
that would dock the 1LMB with DNA. Grey line is the 1LMB 
profile, and black with red dots is the simulated profile. 



files in Figure 16 proposes the following mechanism for 
the lysogenic-lytic transition: Under lysogenic conditions 
where the A-repressor protein prevails, the soliton pairs 
of the A-repressor protein that are located immediately 
prior and after the DNA binding domain are relatively 
stable. But when there is a change in the environmen- 
tal conditions that excites phonon fluctuations along the 
protein chain such as raise in temperature or UV radia- 
tion, or maybe an enzymatic action that remains to be 
identified, either of these soliton pairs can discharge by 
a saddle-node bifurcation. This bifurcation disturbs the 
structure of the immediately adjacent DNA binding mo- 
tif to the extent that the protein looses its capability 
to maintain the lysogenic phase. Since each of the cor- 
responding motifs in the CRO protein are topologically 
more stable single soliton configurations, they are much 
more insensitive to effects such as local phonon excita- 
tions due to UV radiation and thermal effects, and con- 
sequently the lytic phase can take over, we display the 
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FIG. 16: (Color online) The resolved (Ki,n) spectrum for 
the first PDB helix-loop-helix of 1LMB (left) and the corre- 
sponding structure of in 20VG (right). The bond angle k is 
black, torsion angle r is grey (red). The bond angle spectra 
reveal that in 1LMB the loop is a bound state of two close- by 
solitons while in 20VG there is only one soliton profile. 



B: Comparison of A-repressor and CRO soliton 
structures 



In Figure 16 we compare the first two solitons in 1LMB 
to the first loop in the CRO regulator protein that con- 
trols the transition to the lytic state. For the latter, we 
use the PDB structure with code 20 VG. In the CRO 
protein, the first loop is topologically more stable, in the 
sense that it consists of a single soliton. As such the loop 
in 20VG is more stable than in 1LMB. For a plane curve, 
a single soliton can be made or deleted only by transport- 
ing it through one of the end points of the curve. On the 
other hand, a pair of solitons such as the one in the left 
hand side of Figure 16 is not topologically stable but can 
be more easily created or removed locally, by a saddle- 
node bifurcation that brings the two solitons together. 
This removes the corresponding loop by converting it (in 
this case) into a single long a-helix. 

A comparison between the A-repressor and CRO pro- 



first PDB helix-loop-helix motif and for comparison we 
display the corresponding structure in the CRO protein 
with PDB code 20 VG. In the case of A-repressor the mo- 
tif is clearly a bound state of two solitons while in the 
case of CRO we have a single isolated soliton. 



C: Heating and cooling in 1LMB 



We apply (41) to theoretically investigate what takes 



place in 1LMB when we heat it into the random coil 
phase, and then re-cool it back to the collapsed phase. 
According to Anfinsen [29] the protein should return to 
its original conformation. 

We start from the multi-soliton configuration that de- 
scribes the PDB structure 1LMB with parameter values 
given in Table III. Unlike in the previous subsection, dur- 
ing the entire heating and cooling cycle we now keep all 



16 



the parameter values intact. In particular, neither dur- 
ing the heating nor during the cooling do we introduce 
any transient antiferromagnetic parameter a as in the 
previous subsection. Nor do we change the parameter 
values q during the present process. As a consequence 
the position of any soliton and the size of any helical 
structure becomes determined dynamically, without any 
explicit folding nuclei. Moreover, since the parameter 
values q do not change, the strength of the underlying 
hydrogen bond interactions remains constant during the 
simulations. Any deformation or adjustment in the heli- 
cal structures will be entirely due to thermal fluctuations 
during the heating and cooling cycle. 



We introduce the heat bath dynamics (41). The tem- 
perature kT is assumed to be globally determined, and 
the heating and cooling should proceed slowly over the 
biologically relevant time scales. This ensures that the 
entire protein structure is kept at an equal temperature 
value so that we can ignore any effects due to local tem- 
perature variations. 

During heating and cooling cycle we follow the back- 
bone evolution by computing the root-mean-square dis- 
tance (RMSD) between the C a coordinates r i of the 
1LMB backbone and those of the mult i- soliton configu- 
ration Ti, as a function of the temperature 



RMSD(T) 



def 



(43) 



We start from a low temperature value that corresponds 
to the collapsed phase. We again choose the numerical 
value 



kT = 10" 



(44) 



for the initial configuration. We adiabatically increase 
the temperature kT to some high value during 500,000 
MC steps. We then keep the system at this high tem- 
perature during 1,000,000 MC steps, and finally cool it 



back to the original temperature value (44) during an- 



other 500.000 steps. The relatively large number of MC 
steps in our cycle at the high temperature ensures that 
the protein becomes fully thermalized to that tempera- 
ture value. 

We have made simulations with several hundreds of re- 
peated heating and cooling cycles, always starting with 



(44) and heating to different high temperature values 



that are well above the ©-point, where the protein enters 
the random walk phase. From Figure 17 we learn that 
as long as the high temperature value remains below 



kT 



ci 



10" 



(45) 



the protein always returns back to its original shape upon 
cooling. But between values in the range 




l °s 10 kT > 



FIG. 17: At temperature values log/cT < —4 the protein re- 
turns to its original conformation after the heating and cool- 
ing cycle. Between —4 < log kT < —2 a small fraction starts 
to misfiled. At log/cT w —2 there is a rapid transition, so 
that proteins that have been heated to higher temperatures 
become misfolded. 



there is a small fraction of cycles at the end of which 
the protein becomes misfolded. Finally at around the 
temperature value kTci there is a rapid cross-over and if 
the heating temperature exceeds 

T>T C2 

the final conformation is always misfolded. The misfold- 
ing is caused by deformation of one or more of the loops, 
often due to the wrong ordering in loop formation. 



In Figure 18 we show as an example, how (43) evolves 



in average during a typical heating cycle that reaches very 
near the critical temperature value (45). We find that 



during the heating there is a rapid increase in the RMSD 
distance. This is due to the transition to the random coil 
phase. When the system is kept at the high temperature, 
there are substantial thermal fluctuations. The shaded 
region (blue online) denotes the one standard deviation 
extent of these fluctuations. During the cooling we have 
a collapse transition, and at the end the value of (43) 



10" 



kT C i <kT < kT C 2 ~ 10" 



becomes small with practically no fluctuations, indicat- 
ing that the protein has returned to the original 1LMB 
like conformation. In Figure 19 we display a generic ran- 
dom coil structure that we observe during the heating 
phase. It has the look of a typical random coil with no 
regular, helical structure remaining. The structure is not 
static, but subject to very strong thermal fluctuations in 
its shape. 



SUMMARY 

In summary, we have investigated the structure and 
folding pathways of the lysogeny maintaining A-repressor 
protein. Our approach is based on an effective energy 
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FIG. 18: (Color online) The evolution of the RMSD value (43) 
between the 1LMB and the heated structure during one cycle. 
At the high temperature value which is here kT = 10 -4 , the 
RMSD is large and subject to large thermal fluctuations; the 
shaded area denotes the one standard deviation fluctuation 
regime in the values of (43). At the end of the cycle, the 
protein folds back to the conformation of 1LMB. 




FIG. 19: A generic configuration during the high temperature 
regime in the cycle of Figure 18. The structure is an apparent 
random coil with no regular structure, and its detailed form 
is subject to strong thermal fluctuations. 



function, that we have argued describes the small fluc- 
tuation limit of any atomic level energy function. Our 
justification of the energy function is entirely based on 
two very general physical principles: The concept of uni- 
versality originally introduced in the context of phase 
transitions and critical phenomena, and the demand that 
any physical quantity must be independent of the coordi- 
nate system that is used for its description. Using these 
two universal physical principles, we have shown that the 
long wavelength fluctuation limit of the energy function 
for both the backbone C a and side-chain Cp atoms is 
fully determined in an essentially unique fashion. 

The energy function we have derived, computes the G a 
and Cf3 conformation in terms of a soliton solution to a 
variant of the discrete non-linear Schrodinger equation as 
the modular component. The explicit form of the rele- 
vant dark soliton solution is not known to us, but we have 
found that an excellent approximation can be obtained 
by naively discretizing the known dark soliton of the con- 



tinuum non-linear Schrodinger equation. We are able to 
re-construct the entire backbone of the A-repressor pro- 
tein with a precision that compares and even exceeds the 
precision of the experimental crystallographic structure 
with PDB code 1LMB, when we determine the precision 
in terms of the Debye- Waller B-factor fluctuation dis- 
tance. The high precision of our theoretical description 
enables us to conclude that the second soliton solution 
that appears in our description of the 1LMB is unique to 
this protein, there are no other similar solitons in the en- 
tire Protein Data Bank. The remaining solitons including 
the DNA binding one are all ubiquitous in PDB. We have 
also investigated the corresponding soliton structure in 
the CRO protein that is responsible for the lytic phase, 
and found that this soliton appears more stable and is 
also commonly found in the PDB data. These observa- 
tions suggest that the transition between the lysogenic 
and lytic life-cycles could somehow relate to the very ex- 
ceptional structure of the second soliton in 1LMB. 

We have extended our energy function to describe the 
collapse dynamics of 1LMB. In line with the construc- 
tion of the energy function, we rely on the concept of 
universality to propose that at biologically relevant time 
scales the folding dynamics can be described in terms of 
Glauberian relaxation dynamics, with Markovian time 
evolution. In this way we have found that all solitons 
in 1LMB appear to form as soliton-soliton pair. In par- 
ticular, the second soliton with its exceptional structure 
forms as a pair with the DNA binding third soliton; The 
pair of the first soliton flows away and disappears through 
the TV-terminal of the protein. Moreover, if for some rea- 
son the formation of the second and third solitons is dis- 
rupted, for example if the second soliton forms by itself 
before the third soliton, we find that the third soliton 
can not form properly. It is either formed in combina- 
tion of an extra soliton, or then the protein enters in an 
unstructured conformation. The choice between these 
two alternatives is made by the strength of the hydrogen 
bond formation. 

A.N. thanks H. Frauenfelder and G. Petsko for com- 
munications and J. Aqvist for discussions. 
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TABLE III: Our best parameter values for the multi-soliton 
solution that models 1LMB. Notice that in line with (26), 



( 27 ) , the values of mi , rri2 imply that all the regular structures 
are a-helices except for the one separating solitons 4 and 5 
which is a short /3-strand. We also note that the overall scale 
of the parameters is fixed by the normalization of the first 
term in (36) which we have chosen for convenience. 



Soliton 


Qi 


Q2 


mi 


7712 


1 


1.12091 


1.87372 


1.55737 


1.50013 


2 


0.357906 


9.69166 


1.65666 


1.54119 


3 


0.260909 


6.14144 


1.68182 


1.54676 


4 


0.684119 


4.75578 


1.47243 


1.09234 


5 


5.15882 


6.77828 


1.07066 


1.53464 


6 


0.314503 


0.38534 


1.66164 


1.6235 


7 


0.947322 


0.624884 


1.58568 


1.51678 



Soliton 


a 


b 


c/2 


d/2 


1 


-1.0 


-32357500 


47.5340 


438750 


2 


-1.0 


-1.86689 


0.0413976 


0.00101952 


3 


-1.0 


-9.26604 


0.121919 


0.0271569 


4 


-1.0 


-7.51371 


0.013751 


0.0148377 


5 


-1.0 


-25.0125 


0.251803 


0.597751 


6 


-1.0 


-23.9299 


0.0181312 


0.0410994 


7 


-1.0 


7.9809 


0.000215793 


0.037091 



Soliton 


CLQ 


be • 10- 12 


ce 


de 




1 


1.24035 


-475.728 


1.0 


5.44473-10" 


-10 


2 


2.74724 


-198463 


1.0 


0.42667 




3 


1.32293 


-5.021439-10 8 


1.0 


3.74966-10" 


-6 


4 


1.34998 


-81.1641 


1.0 


6.59854-10" 


-9 


5 


1.38447 


-6629519 


1.0 


5.20625-10" 


-7 


6 


1.38293 


-6.28532 


1.0 


1.34992-10" 


-5 


7 


1.23869 


-177.296 


1.0 


7.6271-10" 


7 
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b,p 




dip • 10- 10 
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0.971374 


-6.7206-10" 10 


1.0 


7211.16 


2 


0.813254 


2.3582-10" 7 


1.0 


1481.99 
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0.771272 


2.74 -10" 6 


1.0 


-51.6611 


4 


0.616865 


1.2158-10" 11 


1.0 


5478 


5 


0.89315 


3.87-10" 9 


1.0 


1039.56 


6 


0.545154 


0.184472 


1.0 


1.2685 



7 



0.988183 3.12459-10 -9 1.0 2.09412 



